Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is
P(Xt+1∣Xt,Xt−1,...X2,X1)=P(Xt+1∣Xt)
Transition Probability Matrix
Let us denote
pij=P(Xn+1=i∣Xn=j)
Where [MathProcessingError]pij denotes the probability of going from state “j” to state “i” in one step, similarly we can define [MathProcessingError]pijn as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as
However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important
Let us assume that we want to estimate certain parameter t(θ) and the model is given such that g(θ) is prior density for θ and f(y∥θ) is likelihood of y=y1,y2......yn give the value of θ then the posterior can be written as
g(θ∣y)∝g(θ)f(y∣θ)
Which have to be normalized , then the posterior density will be given by
g(θ∣y)=∫g(θ)f(y∣θ)dθg(θ)f(y∣θ)
For the sake of simplicity let us assume t(θ)=θ and let us assume θ^ is an estimate, then take Square Error Loss Function
L(θ,θ^)=(θ−θ^)2
Then the Classical Risk will be given by Rθ^(θ)=Eθ(L(θ,θ^)) and Bayes Risk is given by
r(θ^)=∫Rθ^(θ)g(θ)dθ
Now our target is to minimize bayes risk to get the bayes estimate
The equation (1) can be minimized if the inner integral is minimized, when
θ^=E(θ∣y)
Now we may not always able to calculate mean of posterior density, that means
θ^=∫θg(θ∣y)dθ
That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate θ that is we take random samples from the kernel g(θ∥y) i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as
Now let us take [MathProcessingError]g(θ∥y)=π(θ) it can be assumed because y is realized and [MathProcessingError]g(θ∥y) is only function of [MathProcessingError]θ , Now comes the MCMC , if we can create a chain whose stationary distribution is [MathProcessingError]π(θ), then we can assume that chain as a random samples which converges to [MathProcessingError]π(θ) and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions
Let us denote π as a probability measure on (X,B) and Φ={X0,X1...} are discrete time Markov Chain on (X,B) , let us assume transition kernel P and k as transition density and can be illustrated as
P(x,A)=Pr(Xi+1∈A∣Xi=x)=∫Ak(x,y)dy
that is P(x,A) gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators
λP where λ is probability distribution on (X,B)
Pf where f is non-negative measurable function on on (X,B)
where
λP(A)=∫Xλ(x)P(x,A)dx
so if Xi∼λ then λP(A) is the marginal distribution of Xi+1∈A and
Pf(x)=∫XP(x,dy)f(y)=Ep[f(Xi+1)∣Xi=x]
and m-step transition probability is given by
Pm(x,A)=∫Akm(x,y)dy
Invariant Density - π
π=πP⇒π(x)=∫Xπ(y)k(y,x)dy
Now there are several way to ensure π is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e
However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for π to be invariant, suppose Xi∼π and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to π that is required for MCMC
Let us Define
ϕ-irreducible A Markov Chain is for some measure ϕ on X,B if for all x∈X and A∈B for which ϕ(A)>0 , there exist n for which Pn(x,A)>0
A Chain is Aperiodic if Period is 1
Harris Recurrent A ϕ- irreducible Markov Chain is Harris Recurrent if a ϕ positive set A, the chain reaches set A with probability 1
Harris Ergodic A Markov Chain is said to be Harris ergodic if it is ϕ irreducible , aperiodic , Harris Recurrent and posses invariant distribution π for some measure ϕ and π
Total Variation Distance The Total Variation distance between two measures μ(.)andv(.) is defined by
∣∣μ(.)−v(.)∣∣=supA∈B∣μ(A)−v(A)∣
What does Harris Ergodicity Guarantees ?
Guaranteed to explore entire space without getting stuck
Strong Consistency of Markov Chain Average
Convergence of Markov Chain to stationary in total Variation Distance
The following two theorems are very important for MCMC
Ergodic Theorem A Markov chain Φ is Harris ergodic with Invariant Distribution π and Eπ∥g(X)∥<∞ for some function g:X→R Then for any starting value x∈X , then
gˉn=n1i=0∑n−1g(Xi)→Eπg(X)almostsurelyasn→∞
and that is the main requirement that we use generally in MCMC
Birkhoff, George D. “Proof of the Ergodic Theorem.” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, 1931, pp. 656–660. JSTOR, www.jstor.org/stable/86016. Accessed 9 Apr. 2021.
The other Theorem is as follows
*Suppose Markov chain Φ is Harris ergodic with invariant distribution of π Then for any starting value x∈X . Φ will converge to π in total variation distance , i.e
∣∣Pn(x,.)−π(.)∣∣→0asn→∞
further ∥∥Pn(x,.)−π(.)∥∥ is monotonically non-increasing in n
Rate of Convergence
The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as geometrically ergodic, i.e there exist M:X→R and some constant t∈(0,1) that satisfy
∣∣Pn(x,.)−π∣∣≤M(x)tnforanyx∈X
If M is bounded , the Markov chain is uniformally ergodic
As long as the starting value of x , such that M(x) is not large, geometric ergodicity guarantees quick convergence of Markov Chain
Geometric Ergodicity holds for every irreducible and aperiodic Markov chain on finite space
What is Needed for Geometric Ergodicity
Drift and Minorization Condition
A Type 1 drift condition holds if there exist some non-negative function V:X→R≥0 and constant 0<γ<1 and L<∞
PV(x)≤γV(x)+Lforanyx∈X
Further we call V a drift function and a γ a drift rate
A Minorization condition holds on set C∈B if there exist some positive integer m,ϵ>0 and probability measure Q in (X,B) for which
Pm(x,A)≥ϵQ(A)
we can also call this m step minorization condition, here C is called small, It imply the following condition
km(x,y)≥ϵq(A)
Proposition
Suppose Markov chain Φ is irreducible and periodic with invariant distribution π , Then Φ is geometrically ergodic if the following two conditions are met:
Type I drift condition hold
There exists some constants d>2L(1−γ) for which one step minorization condition holds on set C={x:V(x)≤d}
This Proposition is a Corollary of Rosenthal(1995a)
Let Φ be a a periodic and irreducible Markov chain with invariant distribution π
Let us suppose the Condition 1&2 of Proposition holds and X0=x0 be the starting value and define
We can rearrange this to see that is satisfy geometric ergodicity condition
V(x) + 1 is proportion to M(x) hence starting point should minimize V(x)
Type II Drift Condition : If there exist some function W : X→[1,∞) finite at some x ∈X, some set D∈B , and constants 0<ρ<1 and b<∞ for which
PW(x)≤ρW(x)+bID(x)forallx∈X
It is easy to show that Type I Drift Condition ⇐⇒ Type II Drift Condition
Finally we can say that
Suppose Markov Chain Φ is aperiodic and ϕ−irreducible with invariant distribution π. Then Φ is geometrically ergodic if there exist some small set D, the drift function W:X→[1,∞) and some constants 0<ρ<1 and b<∞ for which a type II drift conditions hold
Now Let me reinstate the earlier theorem
Suppose Markov chain Φ is Harris ergodic with invariant distribution of π Then for any starting value x∈X . Φ will converge to π in total variation distance , i.e
∣∣Pn(x,.)−π(.)∣∣→0asn→∞
further ∥∥Pn(x,.)−π(.)∥∥ is monotonically non-increasing in n
Jain and Jamison (1967) have shown that for every ϕ−irreducible Markov chain on (X,B) . Then there exists some small set C∈B for which ϕ(C)>0.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0
the Jain and Jamison allow us to assume C∈B such that
P(x,A)≥ϵQ(A)forallx∈C
That is one step minorization condition , Now we can write
P(x,A)=ϵQ(A)+(1−ϵ)R(x,A)forallx∈CandA∈B
Here R(x,.) is probability measure for (X,B) , then this allow us to construct two separate chain which couple with probability 1
Φ(X)={X0,X1...........}Φ(Y)={Y0,Y1............}
Now (Xn,Yn)→(Xn+1,Yn+1) with the following algorithm
While Xn=Yn
If (Xn,Yn)∈C×C
Draw Xn+1∼P(Xn,.) andYn+1∼P(Yn,.) independently
If (Xn,Yn)∈C×C
Draw δn∼Bern(ϵ)
If δn=0 , Draw Xn+1∼R(Xn,.) andYn+1∼R(Yn,.) independently
otherwise , draw Xn+1=Yn+1∼P(x,.)
Once Xn=x=Yn,draw Xn+1=Yn+1∼P(x,.)
Now define coupling time T such that T denotes n for which first time (Xn−1,Yn−1)∈C×C and δn−1=1 , once the chain couples it will remain equal
Now let us assume
X0=xandY0∼π
And Prx denotes the probability with respect to starting point x, then Φ(y) is stationary
However this does not suffices for for the convergence, Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck If we are to prove the balance condition the we are assured that it will converge, Let Φi={θi0,θi1........} and let k1(θ~1,θ1) be the transition density in Φi , then